Introduction to R and RStudio
Introduction
General information
A programming language is a communication code between a human and a
machine (usually a computer). This allows you to give instructions to
the computer. And the computer is very wise, it stupidly carries out
each instruction that we give it.
There is a huge number of programming
languages and they evolve.
R is a programming language that allows you to:
- manipulate data: import, transform, export, etc.
- carry out more or less complex statistical analyses: description, exploration, modelization…
- create (pretty) figures
- and much more !
Features:
- available on mainstream OS’es (Windows, Mac and Linux).
- free and open source (GNU).
- large user community / online help.
- numerous additional packages for any purpose.
History:
- 1993/08 : First official release of R as a
binary.
- Written by Ross Ihaka and Robert Gentleman, simply aimed at being a programming language to teach introductory statistics at the University of Auckland.
- Inspired by (and partially compatible with) the pre-existing, S programming language (BELL Labs, TIBCO Software).
- 1997/04 : foundation of the Comprehensive R Archive Network, as the R central packages repository.
- 1997/12 : release of R v0.6.0 sources under the GNU licence.
- 2000/02 : release of R v1.0.0.
- 2024/04 : release of R v4.4.0.
Details about implementations of R language
WARNING : R is not R ! The R software you will use is NOT the R language itself ! It is an implementation of the language, and other implementations co-exist(ed). By example : Revolution Analytics (by Microsoft, now defunct), Renjin (active, Java), FastR (active, Java), Riposte (defunct), CXXR (defunct, C++) …
R versus Excel:
Fun fact (more awkward than fun, actually) about why R is
better than excel:
In 2020, “Covid
: UK loses thousands of cases due to … a saturated MS Excel file”
More information comparing R and Excel here.
The RGui
After installing R, you can launch it by double-clicking on the R
icon
An interface is software placed between you and the computer that allows you to communicate more easily with the computer.
As you can see, the default R interface (RGui console : GUI stands for Graphical User Interface) is quite abrupt and bleak :S …
As such, it is warmly recommended to use additional software that will function as a graphical interface between you and R. This interface is a kind of shell that makes R work in the background. Several graphical interfaces have been developed, but the most used and practical is RStudio-desktop.
Rstudio
First sight
RStudio displays
4 large panes by default. Their position may be changed based on your
version and preference, but here are the default ones :
| LEFT | RIGHT | |
|---|---|---|
| UPPER | Script pane | Environment/History pane |
| LOWER | Console pane | Help/Plots/Files |
NOTE : your four panes may be blank while most of mines are filled with text in the illustration, do not panic ! We’ll come on that later.
The console pane (lower left)
This is a simple interactive R console, like in the RGui (see previous section), that allows us to communicate with the computer.
WARNING: Here is an example provided by the bioinformatics platform BiGR. Your local RStudio might differ: the version of R, the list of available packages, etc. On your local machine, RStudio console will match with the RGui.
Here, I gave an instruction to R, and he told me that he was unable to execute it with an “Error” message. We communicate.
Let’s try to enter the print()
command :
print("Hello World")[1] "Hello World"
Here, we just used a function (which is an instruction),
called print.
This function attempts to display on screen anything provided between
parenthesis ( and ). In our example, we
provided the character string "Hello World" (thus, we used
quotes), and the function print successfully printed it on
screen !
The information that is put between the parentheses of a function is
called arguments.
Now, click on Session -> Save Workspace As… and name it as you wish (I named mine “my_session.RData”). This will save your current work space (the environment in which you are working : the custom objects and functions loaded into memory). What happened in the console pane? You nailed it! A command has been automatically written. I named the file “my_session.RData”, so for me, it is:
save.image("my_session.RData")This is one of the many ways RStudio helps you in your work, by simplifying (mainly using clicks and buttons) some tedious, regular parts.
As a general recommendation, you should get to save your work space after any important step in your code. When you may need help, whether on a function error, a script result or anything alike, you would be able to share such save to your favorite R-developer next door. This file contains everything you did in your current session.
Note that you can create your own functions in the future.
A package is a collection of R functions that are not natively present in R.
NOTE: The Rstudio console allows auto-completion and code suggestion : when one uses the [TAB] key after writing the first few characters of a code function/variable, the console proposes possible names from a dictonary that fits your installation. Any reputable bioinformatician uses auto-completion, but shhh ! It’s one of our many secrets !
The environment/history pane (upper right)
This pane has three main tabs: ‘Environment’, ‘History’ and ‘Connections’ (the ‘Tutorial’ tab was recently added, and a ‘Git’ tab may appear when one uses code in a version-controlled environment).
Environment
‘Environment’ displays the list of every single variable, custom function, object or data loaded in R. This includes only what you defined by yourself and does not include environment variables nor base nor package-loaded functions.
A variable allows you to store information in memory, by giving it a name (like a box containing an information, and a label). An object is a more complexe variable that allows you to save several information inside one name (see R - Basics section).
Example : in your console pane, enter the following command:
my_var <- 0 # May also be written my_var = 0What happened in the ‘Environment’ pane ?
You nailed it : a new my_var variable is now available in your environment !
When a more complex object is declared in your work space, some general information may be displayed, too. By example :
df <- data.frame("a"=c(1, 3), "b"=c(2, 4))You can observe this dataframe (something like a table, we’ll see it later).
Click on its name to get a preview of its contained data. Then, click on the light-blue downwards arrow to have a deeper insight of its content:
Now click on Session -> Clear Work space …
… and watch how your objects/variables disappeared !
This action cannot be undone ! While it is useful to clear one’s work space from time to time (in order to avoid name space collisions, release a bit of used RAM, etc), it is way better to save your work space before.
One again, using this “clickable” way is just an assist from RStudio : there are specific (written) commands to remove objects, with more manual control.
History
This tab is quite mandatory when coding and testing : while you test and search in the console, the history keeps a track of each command line you typed. This will definitely help you to build your scripts, to pass your command lines to coworkers, try and track code variations, and to revert possible unfortunate errors.
Each history is related to a session. You may see many commands in your history, even some you never entered manually : when a RStudio “assist” menu/button is used (ex : knitting commands, display commands, help commands, etc…), in most case it auto-runs a R command that will be stored in the history as well as the one you manually entered.
Please note that your history has a limit (512 entries by default) and only saves the latest command lines. The default value can be changed (but it’s something to reserve to more advanced users)
The help/plots/packages/files pane (lower right)
This pane has four main tabs: ‘Files’, ‘Plots’, ‘Packages’ and ‘Help’ (the ‘Viewer’ tab is rarely used, and ‘Presentation’ was added very recently).
Help
This is maybe the most important/useful pane of your R Studio, on a user POV. THIS is the difference between R Studio and another code editor. Search for any function here, locally and not on the internet. This pane shows you the available help for YOUR version of R, YOUR version of a given package.
Effectively, a published function for which a help page does exist in a version of R or one of its packages, still can evolve in time, depending on its author. Thus, some parameters can be added, removed, renamed, and/or default values changed. Consequently, the help from a remote source (internet) may be inadequate for your current, installed version !
When copy-pasting commands from a remote source, please assure yourself the source is clean (ie, nothing harmful for your system) and compatible with your installed version (this is a major source of error).
Never ever copy code from the internet right to your console Why? Example: https://www.wizer-training.com/blog/copy-paste
Files
Just like any system file explorer, we can move across directories, create folders and files, delete them, etc… from within Rstudio.
Initially, this file explorer is set at your current working position. This means that when interacting with directories or files without expressing an absolute path, Rstudio will consider directories and files that are stored where you are in this panel
NOTE : this is not a theoretical representation : performing modifications here will modify the file structure on your drive !
Create a directory
Here, we will create a new directory, using two methods :
- With the GUI
- With a command line :
Use the dir.create()
function :
dir.create("Intro_R")Change working directory
One can change one’s working directory :
- With the GUI :
- With a command :
Use setwd():
setwd("Intro_R")Delete files
You can delete files:
- With the GUI
- With a command
Use the file.remove()
function:
file.remove("annotation.csv")
file.remove("expression.txt")NOTE: There is nothing such as a “Recycle bin” here ! Deleted files or directories are wiped !
Packages
Here are listed all locally installed packages (not available packages), with a description and a version number.
You will get more information about packages in an upcoming subsection.
Plots
If you work with R scripts that performs plots, the generated graphs will be displayed here !
These can sometimes be interactive plots (ability to zoom, scroll, etc…), depending on the functions used.
The script pane (upper left)
This is where you will spend most of your time when writing your R
scripts (next to the ‘Help’ pane !).
A script is a text file, where you save your command line to use, in the
order of using.
This pane also accepts other languages (e.g. bash,
python, …) or different R flavors (R markdown, by example), but
R Studio shines for its R integration, oviously.
Please, please ! Write your commands in the ‘Script’ pane, then execute them (hitting [CTRL] + [Enter] with the cursor placed on the command to run), rather than directly writing them in the console ! This has only advantages : you can track, save, share your work, test variations, without relying on your (spoiler : imperfect) memory, …
The default file extension for a R script saved is .R or
.r (or .RMD, .Rmd for Rmarkdown
scripts), for example my_script.R.
NOTE : The file name here appears in red and ending with a star ’*’ because some content was written in the script but not saved on disk.
TLDR – Too Long Didn’t Read
Graphic interface presentation :
- Write command lines in the ’Script’** pane (upper left)
- Execute command lines by hitting [CTRL] + [Enter] from the script pane, then observe their output in the ‘Console’ / ‘Plots’ panes.
- Take a look at your environment and history in the upper right pane
- Search for help in the lower right pane.
R – Basics
Variables and types
Remember : a variable is a name given to any value of any type stored in memory.
Here, we will describe basic types.
A type is a data type. It can be an integer, a decimal number, a string,
an array, etc.
Number
By example, 3, the number three, exists in R and is
understood as it by R. You can store this value in a variable using the
arrow assignment operator <-:
three <- 3In the above code, the number 3 is stored in a variable called “three” (it could be any other name, as long as it avoids special characters).
You can do this in R with quite anything. Literally anything. Whole files, pipelines, images, anything.
# The example below is a very good example of
# how to never ever name a variable:
シ <- "happy"Maths in R works the same as your regular calculator :
3 + three # Add[1] 6
1 - 2 # Subtract[1] -1
4 / 2 # Divide[1] 2
3 * 4 # Multiply[1] 12
7 %/% 2 # Floor division[1] 3
Info:
#is the way to write a comment into your script. Any instruction after#, and as long as no line return is invoked, will be ignored by the R console. For example in3 + three # Add,3 + threewill be executed, but# Addwill not. A good coder uses comments, and a lot of them, and relevant ones, to explain the calculation one made !
NOTE : Here, we simplified how R is actually handling numbers. There are two main subtypes in numbers : 1) Integers 2) Numerics (called “floats” in most other languages).
Character
Characters correspond to letters and letter chains, and are delimited
with quotes : be it single ' or double " (as
long as the same type is uesd in a pair : on cannot start the definition
of a character with a single quote and end it with a double one !) :
four <- "4"
five <- '5'Mathematics do not work with characters at all … Try the following:
"4" + 1
four + 1Computer answer
Error in “4” + 1 : non-numeric argument to binary operator
You can try to turn characters in numbers with the function: as.numeric():
as.numeric("4") + 1[1] 5
as.numeric(four) + 1[1] 5
Function
A function is a R piece of code that can perform more
complex work. It is called using a command name, that is then followed
by parentheses ( and ). Between these
parentheses, we enter arguments that are expected by the
function.
Use the help pane to get information about the names of arguments expected and/or understood by a given function, and the expected type of their value.
As previously described, you can store any of the previously typed commands into a variable:
five <- as.numeric("4") + 1
print(five)[1] 5
Please! Please! Give your variable explanatory names, so that its content is understandable by humans. I would be pissed off seeing any of you calling their variable “a”, “b”, “xyz,”my_awsome_var”, “dunno_what_it_is”, “thing1”, “thing2”, etc …
Question :
I have two numbers stored in these variables :
mysterious_number_7suspicious_number_7
When I apply the print()
function on them, it returns in both cases 7.
They are both numeric.
However, they are not equal …
Why ?
Code to check question facts
# Show the value of the variable mysterious_number_7
print(mysterious_number_7)[1] 7
# Show the value of the number suspicious_number_7
print(suspicious_number_7)[1] 7
# Check that mysterious_number_7 is a number
is.numeric(mysterious_number_7)[1] TRUE
# Check that suspicious_number_7 is a number
is.numeric(suspicious_number_7)[1] TRUE
# Check that values of mysterious_number_7 and suspicious_number_7 are equal
mysterious_number_7 == suspicious_number_7[1] FALSE
# Check that values of mysterious_number_7 and suspicious_number_7 are identical
identical(mysterious_number_7, suspicious_number_7)[1] FALSE
We will talk about differences between equality and identity later.
Answer
This is due to the number of digits displayed in R. You are very likely to have issues with that in the future, as all (bio)informatician around the world.
mysterious_number_7 <- 7.0000001
suspicious_number_7 <- 7You can change the number of displayed digits with the function options():
options(digits=8) #allows you to print up to 8 digits
print(mysterious_number_7)[1] 7.0000001
Logical
Aside from characters and numbers, there is another very important
type in R (and computer science in general): Logicals, which corresponds
to the boolean types. There are two logicals :
TRUE and FALSE.
3 > 4[1] FALSE
5 < 10[1] TRUE
5 == 10[1] FALSE
Data structures
Until now, we have seen simple information stored into a variable. But we can create more complex structures in order to store multiple information into a single variable.
NOTE : In your usage of R, you may cross R objects. These correspond to a special type of complex objects, that can store mutiple data of different types, but also obey rules to allow users to interact with (by example, getters and setters, to get or set information from and to it)
Vector
You can make vectors and tables in R. Don’t panic, there will be no maths in this course : vectors or a one-dimensional series of values of the same type.
In R, vectors can be created with the c()
function (“c” for “concatenate”) :
one2twenty <- c("1", "2", "3", "4", "10", "20")
print(one2twenty)[1] "1" "2" "3" "4" "10" "20"
One can check if a variable is a vector :
is.vector(one2twenty)[1] TRUE
As a uni-dimensional object, one can get its length :
length(one2twenty)[1] 6
One can select an element of the vector using
squared brackets [ and ] holding the desired
element index :
one2twenty[1] # select the first element[1] "1"
NOTE : R index starts at the value 1 (many other languages start at 0)
one2twenty[5] # select the fifth element[1] "10"
One can select multiple elements of a vector with a
ranged index ::
one2twenty[2:4] # select from second to fourth element[1] "2" "3" "4"
One can combine both ways :
one2twenty[c(1,3:5)] # select the first and third to fifth elements[1] "1" "3" "4" "10"
Question 1: Is there a difference between these two vectors ?
c_vector <- c("1", "2", "3")
n_vector <- c( 1, 2, 3 )Answer
There is a difference indeed: c_vector contains characters, n_vector contains numerics.
print(c_vector)[1] "1" "2" "3"
print(n_vector)[1] 1 2 3
print(is.numeric(c_vector))[1] FALSE
print(is.numeric(n_vector))[1] TRUE
identical(c_vector, n_vector)[1] FALSE
You can always use the identical()
function to test identity with robustness and exactitude.
You may have learned about the operator == for equality.
But this is not perfect, look at our example:
c_vector == n_vector[1] TRUE TRUE TRUE
The operator == is not aware of data types.
Another example, mixing numeric and booleans:
1 == TRUE[1] TRUE
identical(1, TRUE)[1] FALSE
In computer sciences, there is a historical reason why booleans and integers are mixed.
More information about that
This is linked to the history of computing languages. Prior to the introduction of an actual boolean type (TRUE and
FALSE), the integers 0 and 1 were
the official representation for true and false values, similar to what
is used in C-89 (the first standardized version of the C programming
language, in 1989).To avoid unnecessarily breaking imperfect but working code, the new boolean type needed to work just like 0s and 1s. This goes beyond merely truth value, but all integral operations. No one would recommend using a boolean results in a numeric context, nor would most people recommend testing equality to determine truth value, as well as no one wanted to find out the hard way just how much existing code is that way. Thus, the decision to make True and False mimics for 1 and 0, respectively. This is merely a historical artifact of the linguistic evolution.
Question 2: Can I include both text and numbers in a
vector ?
mixed_vector <- c(1, "2", 3)Answer
YES and NO. The example shows that we technically can, we end with an unexpected results : Here, all our values have been turned into characters, as we can not mix types in a vector. Either all its content is made of number or all its content is made of characters.
print(mixed_vector)[1] "1" "2" "3"
print(is.numeric(mixed_vector))[1] FALSE
print(is.character(mixed_vector))[1] TRUE
Question 3: How to create a histogram from a vector ?
Help
A simple way to visualize your data is to use a graph. Thehist()
function may help you (of course, use the Help pane!!).
Answer
hist(c_vector)Error in hist.default(c_vector) : ‘x’ must be numeric
Errr… Why is this command not working ?
The error says : “‘x’ must be numeric”. The function only accepts a vector made of numeric values.
hist(n_vector) # worked perfectly !Data Frame
In R, a table is stored as a data.frame. A way to
create it from scratch using the data.frame()
function :
one2three4 <- data.frame(col1 = c(1, 3), col2 = c(2, 4))
print(one2three4) col1 col2
1 1 2
2 3 4
By default, for dataframes, R accepts names for columns and rows. You
can rename columns and row names with the colnames()
and rownames()
functions, respectively.
colnames(one2three4) <- c("Col_1_3", "Col_2_4")
rownames(one2three4) <- c("Row_1_2", "Row_3_4")
print(one2three4) Col_1_3 Col_2_4
Row_1_2 1 2
Row_3_4 3 4
You can access a column and a line in the data frame using an index
for each dimension into squared brackets [ and
].
R expects first an index for rows, then columns :
data_frame[row_index, column_index]. You can either use the
name of row(s)/column(s) or their position index.
If one wants all values for one of the dimensions, one should just leave the bracket part empty.
# Select a row by its name
print(one2three4["Row_1_2", ]) Col_1_3 Col_2_4
Row_1_2 1 2
# Select a row by its index
print(one2three4[1, ]) Col_1_3 Col_2_4
Row_1_2 1 2
# Select a column by its name
print(one2three4[, "Col_1_3"])[1] 1 3
# Select a column by its index
print(one2three4[, 1])[1] 1 3
# Select a cell in the table
print(one2three4["Row_1_2", "Col_1_3"])[1] 1
# Select the first two rows and the first column in the table
print(one2three4[1:2, 1])[1] 1 3
If you like maths, you will recall the
[row, column]order. If you’re not familiar with that, you probably will do like 99% of all software engineers : write in the wrong[column, row]order in your first interactions with R. Subsequently, you will raise an error. Trust me. 99%, easy. And just remember than an error is never an ending point in informatics.
Question 1 : Can I mix characters and numbers in a data frame row ?
Answer
Indeed, it is possible !
mixed_data_frame <- data.frame(
"Character_Column" = c("a", "b", "c"),
"Number_Column" = c(4, 5, 6)
)
print(mixed_data_frame) Character_Column Number_Column
1 a 4
2 b 5
3 c 6
The str()
function can be used to look at the types of each elements in an
object.
str(mixed_data_frame)'data.frame': 3 obs. of 2 variables:
$ Character_Column: chr "a" "b" "c"
$ Number_Column : num 4 5 6
str(one2three4)'data.frame': 2 obs. of 2 variables:
$ Col_1_3: num 1 3
$ Col_2_4: num 2 4
Question 2 : Can I mix characters and numbers in a data frame column ?
Answer
Unfortunatelu, no, you can’t :
mixed_data_frame <- data.frame(
"Mixed_letters" = c(1, "b", "c"),
"Mixed_numbers" = c(4, "5", 6)
)
print(mixed_data_frame) Mixed_letters Mixed_numbers
1 1 4
2 b 5
3 c 6
str(mixed_data_frame)'data.frame': 3 obs. of 2 variables:
$ Mixed_letters: chr "1" "b" "c"
$ Mixed_numbers: chr "4" "5" "6"
All this is because the data.frame is a special sort of bidimensional object in R structured by column, where each column can store a different type than the other.
Read a table as data frame
Exercise: Use the Help pane to find how to use the read.csv()
function.
You can find the example_table.csv input file here. Download it by clicking on the [Download raw file] button.
Use the read.csv()
function to:
- open the file
example_table.csv. - this table has a header (
TRUE). A header is a title line that defines column names. - this table has row names in the column called “Gene_id”.
- as a CSV file, its field separator is a comma ,.
Let all other parameters to their default values.
Save the opened table in a variable called
example_table.
Answer
example_table <- read.csv(file="example_table.csv",
header=TRUE,
row.names="Gene_id",
sep = ","
)Now let us explore this dataset.
We can click on the ‘Environment’ pane:
And if you click on it :
Be careful ! Visualizing a large table may hang your session (by consuming to much RAM).
Alternatively, we can use the head()
function, which prints the first lines of a table:
head(example_table) Sample1 Sample2 Sample3 Sample4
Caml 9.998194 10.004116 9.172489 9.139667
Scamp5 9.995917 10.818685 11.417558 14.907892
Dgki 9.993974 13.664396 16.132275 17.420057
Mas1 9.993956 11.370854 11.233629 9.912863
Apba1 9.992540 14.253438 14.001228 13.654701
Phkg2 9.980898 8.748654 8.714821 9.146529
The summary()
function describes the dataset per sample (per column) :
summary(example_table) Sample1 Sample2 Sample3 Sample4
Min. : 9.9437 Min. : 6.8385 Min. : 5.5512 Min. : 5.8437
1st Qu.: 9.9526 1st Qu.: 9.0000 1st Qu.: 10.1196 1st Qu.: 9.7785
Median : 9.9710 Median : 10.9544 Median : 11.3256 Median : 11.9052
Mean :18.9372 Mean : 19.8355 Mean : 20.8277 Mean : 21.4123
3rd Qu.: 9.9940 3rd Qu.: 12.6467 3rd Qu.: 12.6499 3rd Qu.: 13.9677
Max. :99.7837 Max. :105.0774 Max. :112.1882 Max. :111.8205
Have a look at the summary()
of the dataset per gene, using the t()
function to transpose:
head(t(example_table)) Caml Scamp5 Dgki Mas1 Apba1 Phkg2 Timm8b
Sample1 9.998194 9.995917 9.993974 9.993956 9.992540 9.980898 99.78373
Sample2 10.004116 10.818685 13.664396 11.370854 14.253438 8.748654 105.07739
Sample3 9.172489 11.417558 16.132275 11.233629 14.001228 8.714821 112.18819
Sample4 9.139667 14.907892 17.420057 9.912863 13.654701 9.146529 109.09544
Capn7 Yrdc Coq10a Gm27000 Lrrc41 Acadsb Pdzd11
Sample1 9.976005 9.971093 9.970835 9.965511 9.960667 9.959179 9.952750
Sample2 11.314599 8.905508 8.820582 7.414795 9.961954 11.261520 9.031553
Sample3 11.452421 7.367243 10.449131 7.709008 10.435298 12.336088 10.700876
Sample4 11.692871 9.375526 10.865062 13.126211 9.137375 12.703318 10.832218
Smarca2 Gm26079 Ptpn5 Rexo2 Ifi27 Snhg20
Sample1 9.952224 99.514659 9.947524 9.946340 9.943989 9.943724
Sample2 9.272424 103.089626 11.090058 13.363912 12.407626 6.838499
Sample3 11.194709 109.856535 11.572261 11.477445 13.591186 5.551247
Sample4 12.117571 111.820504 10.255021 12.292877 14.906542 5.843670
summary(t(example_table)) Caml Scamp5 Dgki Mas1
Min. : 9.1397 Min. : 9.9959 Min. : 9.994 Min. : 9.9129
1st Qu.: 9.1643 1st Qu.:10.6130 1st Qu.:12.747 1st Qu.: 9.9737
Median : 9.5853 Median :11.1181 Median :14.898 Median :10.6138
Mean : 9.5786 Mean :11.7850 Mean :14.303 Mean :10.6278
3rd Qu.: 9.9997 3rd Qu.:12.2901 3rd Qu.:16.454 3rd Qu.:11.2679
Max. :10.0041 Max. :14.9079 Max. :17.420 Max. :11.3709
Apba1 Phkg2 Timm8b Capn7
Min. : 9.9925 Min. :8.7148 Min. : 99.784 Min. : 9.976
1st Qu.:12.7392 1st Qu.:8.7402 1st Qu.:103.754 1st Qu.:10.980
Median :13.8280 Median :8.9476 Median :107.086 Median :11.384
Mean :12.9755 Mean :9.1477 Mean :106.536 Mean :11.109
3rd Qu.:14.0643 3rd Qu.:9.3551 3rd Qu.:109.869 3rd Qu.:11.513
Max. :14.2534 Max. :9.9809 Max. :112.188 Max. :11.693
Yrdc Coq10a Gm27000 Lrrc41
Min. :7.3672 Min. : 8.8206 Min. : 7.4148 Min. : 9.1374
1st Qu.:8.5209 1st Qu.: 9.6833 1st Qu.: 7.6355 1st Qu.: 9.7548
Median :9.1405 Median :10.2100 Median : 8.8373 Median : 9.9613
Mean :8.9048 Mean :10.0264 Mean : 9.5539 Mean : 9.8738
3rd Qu.:9.5244 3rd Qu.:10.5531 3rd Qu.:10.7557 3rd Qu.:10.0803
Max. :9.9711 Max. :10.8651 Max. :13.1262 Max. :10.4353
Acadsb Pdzd11 Smarca2 Gm26079
Min. : 9.9592 Min. : 9.0316 Min. : 9.2724 Min. : 99.515
1st Qu.:10.9359 1st Qu.: 9.7225 1st Qu.: 9.7823 1st Qu.:102.196
Median :11.7988 Median :10.3268 Median :10.5735 Median :106.473
Mean :11.5650 Mean :10.1293 Mean :10.6342 Mean :106.070
3rd Qu.:12.4279 3rd Qu.:10.7337 3rd Qu.:11.4254 3rd Qu.:110.348
Max. :12.7033 Max. :10.8322 Max. :12.1176 Max. :111.821
Ptpn5 Rexo2 Ifi27 Snhg20
Min. : 9.9475 Min. : 9.9463 Min. : 9.944 Min. :5.5512
1st Qu.:10.1781 1st Qu.:11.0947 1st Qu.:11.792 1st Qu.:5.7706
Median :10.6725 Median :11.8852 Median :12.999 Median :6.3411
Mean :10.7162 Mean :11.7701 Mean :12.712 Mean :7.0443
3rd Qu.:11.2106 3rd Qu.:12.5606 3rd Qu.:13.920 3rd Qu.:7.6148
Max. :11.5723 Max. :13.3639 Max. :14.907 Max. :9.9437
To go further
# number of columns
ncol(example_table)[1] 4
# number of rows
nrow(example_table)[1] 20
# get dimensions (number of rows and number of columns)
dim(example_table)[1] 20 4
# type of each element
str(example_table)'data.frame': 20 obs. of 4 variables:
$ Sample1: num 10 10 9.99 9.99 9.99 ...
$ Sample2: num 10 10.8 13.7 11.4 14.3 ...
$ Sample3: num 9.17 11.42 16.13 11.23 14 ...
$ Sample4: num 9.14 14.91 17.42 9.91 13.65 ...
TLDR – Too Long Didn’t Read !
# Declare a variable, and store a value inside :
three <- 3
# Basic maths operators: + - / * work as intended:
six <- 3 + 3
# Quotes are used to delimit characters:
seven <- "7"
# You cannot perform maths on characters :
"7" + 8 # raises an error
seven + 8 # also raises an error
six + 8 # works fine
# R makes the most to help you. You can change the type of your variable with:
as.numeric("4") # the character '4' becomes the number 4
as.character(10) # the number 10 becomes the character 10
# You can compare values, returns a logical :
six < seven
six + 1 >= seven
identical(example_table, mixed_data_frame)
# You can load and save a data.frame as/from a text file :
read.table(file = ..., sep = ..., header = TRUE)
write.table(x = ..., file = ...)
# Create a table with:
my_table <- data.frame(...)
# Create a vector with:
my_vector <- c(...)
# You can read the first lines of a data.frame :
head(example_table)
# Search for help in the 'Help' pane or with:
help(...)R – Packages
What are modules and packages
Modules and package are considered to be the same thing in this lesson. The difference is technical and does not relate to our scope.
Most of the work you are likely to do with R will require one or multiple additional packages. A package is a list of functions, pipelines, or datasets shipped under a given name. In general, a package groups together functions linked to an analysis scheme or theme towards a defined aim. Every single function you use through R comes from a package or another. Those used till now in this lesson come mostly from the ‘base’ package (R is shipped with a short list of mandatory packages)
Invoke the help page for the print function, and read
the very first line of the ‘print’ pane :
help(print)It reads: print {base} : The function print
comes from the package base.
WARNING : Sometimes, two packages may share the same name for different functions ! They are most certainly not doing the exact same thing. IMHO, it is a good habbit to ALWAYS call a function while disambiguating the package name, using thins syntax :
package::function. Writingbase::print()is better, clearer than usingprint()alone.
# Call the function print(), with the argument "You're the best!"
print("You're the best!")
# Call the function print() from the package base, with the argument "You're the best!"
base::print("You're the best!")Install a package
Your work will probably require the installation of a new package from an external source.
R can install packages from multiple different sources and repositories, but by default, it installs from the CRAN repository.
Use install.packages()
to install a package.
# Install a package with the following function
install.packages("dplyr")This will raise a prompt asking for simple questions : from which mirror site to download (choose somewhere in France), whether to update other packages or not, etc.
Do not be afraid by the large amount of things prompted in the console and let R do the trick. These are messages from the different steps required by to safely install a package and test if it’s perfectly installed and usable. It may also take time, especially when the requested package to install requires other packages (dependencies), that may themselves require other packages (etc…). Almost any code relies on another code !
Alternatively, you can click on [Tool] -> [Install Packages] in RStudio; or click on the [install] button in the ‘Packages’ tab of the ‘File/Help’ pane.
You can list installed packages with installed.packages(),
and find for packages that can be updated with old.packages().
These packages can be updated with update.packages().
While the install.packages()
function searches packages in the common R package list, many
bioinformatics-related packages are available on other shared packages
warehouses. Just like your AppStore and PlayStore do not have the same
applications on your cellular, R has multiple sources for its packages.
As someone invested in biology who wants to use R, in addition to the
default CRAN repository, you need to know another one : Bioconductor.
You can install a package from Bioconductor with the BiocManager::install()
function :
# Install BiocManager, a package to use Bioconductor, from CRAN
install.packages("BiocManager")
# Install a package from Bioconductor
BiocManager::install("DESeq2")Hopefully, BioConductor knows that the CRAN exists, so when one requests the installation of a BioConductor package that depends on CRAN packages, the installer will get them without any human intervention.
Use a package
An installed package is not actually active by default (you can’t use its functions just because it is installed). When you want to use something from an installed package, you need to invoke it first.
You can load a package with the library()
function :
library(package="dplyr") # or just library(dplyr)If the requested package is not locally installed, this will raise an error.
If there is no error message, it implies that the package is loaded.
When a package is loaded, all its content (functions, objects, datasets, …) are accessible and loaded into RAM.
NOTE 1 : when a package is loaded, its box in the ‘Packages’ pane is ticked.
NOTE 2 : There is an exception to the package invocation requirement : when one calls a function with disambiguation syntax
package::function(), only the requested function is loaded into memory for the duration of its exectution, then detached.
Then you can try :
help(topic="arrange", package="dplyr")And search for help about how to run your command.
Alternatively, there is a more complete help page at the package,
reached with the browseVignettes()
function. It opens in your browser automatically (or in the Rstudio
browser, depending on your configuration), and if you click on “HTML”,
you get some information about the package like functions, tutorials,
etc.
NOTE : However, such a vignette is facultative in a R package : its presence only depends on the will of the package coder(s).
browseVignettes(package="dplyr")TLDR – Too Long Didn’t Read
# Install a CRAN package
install.packages("BiocManager")
# Load a package
library("BiocManager")
# Install a BioConductor package
BiocManager::install("DESeq2")
# Get help
browseVignettes(package="DESeq2")Tips for your project
Write a good script
Good practices (for your sake, and the one of any future reader of your code) :
- write a documentation (a header at the start of the script which explains the purpose of the script, an explanation of the parameters you setup, and the analysis steps, for example)
- use comments (uninterpreted line, starting with
#) - use code indentation (spaces before code line that shows their hierarchical structure)
- use humanly-understandable variable names
- do not nest too many functions inside each other, this will soon be a mess
### BAD ; difficult to understand
print(rowMeans(data.frame(c(9, 14, 17, 9, 13),
c(11, 10, 20, 7, 17),c(15, 8, 19, 10, 15) )) )
### GOOD : easy to understand
## Goal: this script computes the mean of the expression of our 3 samples for each gene:
#create a dataframe with the genes expression of our 3 samples:
example_data_frame <- data.frame("Expression_Sample_1" = c(9, 14, 17, 9, 13),
"Expression_Sample_2" = c(11, 10, 20, 7, 17),
"Expression_Sample_3" = c(15, 8, 19, 10, 15)
)
#add corresponding genes names into row names:
rownames(example_data_frame) <- c("Caml", "Scamp5", "Dgki", "Mas1", "Apba1")
#compute the mean of the expression for each gene:
mean_expression_Samples123 <- rowMeans(example_data_frame)
#print the result:
print(mean_expression_Samples123)- save your script frequently, as well as your working environment
- save the versions of the loaded packages at the end of your analysis
(you can print loaded packages thanks to the
sessionInfo()function and save the result into an on-disk file thanks to thecapture.output()) function.
sessionInfo() #displays name and version of loaded packages in the consoleR version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] C
time zone: Europe/Paris
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] digest_0.6.35 R6_2.5.1 bookdown_0.39
[4] fastmap_1.1.1 xfun_0.43 cachem_1.0.8
[7] knitr_1.46 htmltools_0.5.8.1 rmarkdown_2.26
[10] lifecycle_1.0.4 cli_3.6.2 rmdformatsbigr_1.0.0
[13] sass_0.4.9 jquerylib_0.1.4 compiler_4.4.0
[16] highr_0.10 tools_4.4.0 evaluate_0.23
[19] bslib_0.7.0 yaml_2.3.8 jsonlite_1.8.8
[22] rlang_1.1.3
utils::capture.output(sessionInfo(), file = "sessionInfo.txt") #save them in a fileLoad and save R objects
While working on your projects, you will process datasets in R. The results of these analyses will be stored in variables. All these are stored in memory, which is volatile. This means, when you close RStudio, any unsaved object will be lost.
We already depicted the save.image()
function to save a copy of your complete working
environment.
However, you can save only the content of a given variable, solely. This is useful when you want to save the result of a function (or a pipeline) but not the whole 5 hours of trials and error work you’ve been spending on how-to-make-that-bloody-pipeline-work-correctly.
The (compressed) format to save an object is called RDS
for R Data Serialization. This is done using the saveRDS()
function :
saveRDS(object = example_table, file = "example_table.RDS")Hopefully, you can load a RDS content into a variable ! This is
useful when you receive a RDS from a coworker, or you’d like to keep
going your work from a saved point. This is done with the readRDS()
function :
example_table <- readRDS(file = "example_table.RDS")NOTE : You can see in the example above that loading an object save as a RDS requires to inject it in a variable. This means that the RDS contains the object, but unnamed. This is not the case when saving an environment (using
save()), as all loaded objects are saved, thus their respective names are kept.
Human data
WARNING : If you hold human-related genomic datasets that can allow person identification (ie, sequences), you cannot use/upload these data anywhere without control. This is strictly illegal, and such behavior may take your for up to 5 years in jail, assorted to a 300 000€ fine. Art. 226-16, Section 5, Code pénal.
Packages update
It is a good practice to maintain package versions within a work project. If you update a package (whether by need, or by will), then you should restart your work from the start. This stands as long as you’re not 100% sure the update does not affect your results.
The Swirl package
What if I would like to pursue learning R by myself, when I can or want ?
What is swirl?
swirl is an R
package that teaches you R programming and data science interactively,
at your own pace, and straight into the R console.
It presents a choice of course lessons and interactively tutors a student through them, with multiple levels of complexity. A student may be asked to watch a video, to answer a multiple-choice or fill-in-the-blanks question, or to enter a command in the R console precisely as if he or she were using R in practice. Emphasis is on the last, interacting with the R console. User responses are tested for correctness and hints are given if appropriate.
Progress is automatically saved so that one may quit at any time and later resume without losing anything.
Installation and usage
#install package
install.packages("swirl")
#load package
library(swirl)
#install the R course for
install_course("R Programming")
#start the course
swirl()Enjoy!
Other useful command lines for swirl usage
#quit swirl
bye()
#skip a question
skip()
#return to the main menu
main()
#allow experimentation in the R console without interference from swirl
play()
#to resume interacting with swirl
nxt()
#display a help menu
info()Conclusion
No programming language is better than any other. They all serve different purposes. Anyone telling the opposite is (over)-specialized in the language they are advertising (and probably have a strong lack of objectivity.
In the field of bioinformatics, languages used by the community are quite limited. The main, widely adopted options are :
While learning bash cannot be escaped nowadays to interact with a HPC, it is not enough to perform a complete analysis with publication-ready figures and shaped results. You should get interest in another programming language: R and/or Python. R allows you to do a lot of different analyses, and it has a large user community with lots of online resources for help. As such, it’s one of the easiest languages for beginners.
Please, note that this advice is valid today, but may change. Mainy other programming languages are used, some have lost their place on the podium, and others are trying to supersede bash, R, and Python. An example is Julia